Multimodal reasoning based on knowledge graph embedding for specific diseases

Abstract Motivation Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by KG embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases. Results This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a combined Cancer5 and a combined Diseases11, aiming to discover new reliable knowledge and provide universal pre-trained knowledge for that specific disease field. SDKG-11 is obtained through original triplet extraction, standard entity set construction, entity linking and relation linking. We implement multimodal reasoning by reverse-hyperplane projection for SDKGs based on structure, category and description embeddings. Multimodal reasoning improves pre-existing models on all SDKGs using entity prediction task as the evaluation protocol. We verify the model’s reliability in discovering new knowledge by manually proofreading predicted drug–gene, gene–disease and disease–drug pairs. Using embedding results as initialization parameters for the biomolecular interaction classification, we demonstrate the universality of embedding models. Availability and implementation The constructed SDKG-11 and the implementation by TensorFlow are available from https://github.com/ZhuChaoY/SDKG-11. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Knowledge Graph (KG) is a way to store knowledge and reveal the dynamic development law of a field. KG represents facts in the realworld through a large number of triplets (head entity, relation and tail entity), denoted as ðh; r; tÞ. Large integrated KGs like Freebase (Bollacker, 2008) and DBpedia (Lehmann et al., 2015) keep expanding. They have been successfully used in many applications, such as recommendation systems (Wang et al., 2019a) and question answering (Huang et al., 2019).
In the field of biomedicine, the application of KG is becoming increasingly popular for its specialized knowledge that only domain experts can understand well (Mohamed et al., 2021). The influential roles of KG in predicting protein drug targets Mohamed et al. (2020) and adverse drug reactions  are both convincing examples. The triplets of biomedical KGs can be manually filled out by experts or automatically extracted from Electronic Medical Records (EMRs) and literature (Li et al., 2020). The former is labor-intensive for large-scale KGs; While the latter is becoming more effective benefit from the rapid improvement of natural language processing.
Most existing biomedical KGs focus on particular subfields, such as DrugBank (Wishart et al., 2018) for drugs and UniProt (UniProt Consortium, 2019) for proteins. However, these subfields are divided at the entity level, and few KGs focus on specific diseases. Specific Disease Knowledge Graph (SDKG) mainly focuses on the knowledge of a particular disease, which can play a more professional role in guiding the causes, treatments and prognoses of the disease. Recently, in response to the COVID-19, over three associated SDKGs had been constructed for drug repurposing (Al-Saleem et al., 2021;Che et al., 2021;. A chronic obstructive pulmonary disease (COPD) SDKG was established to assist in diagnosing early curable stage COPD (Fang et al., 2019). There was also a melanoma SDKG built to support precision medicine (Kang et al., 2020). Considering the event that it is hard to access all the knowledge from literature (e.g. all PubMed abstracts), limiting to several diseases allows for a greater concentration of valid information. We consider 11 diseases in this work, named SDKG-11, including 5 cancers (colon cancer, gallbladder cancer, gastric cancer, liver cancer and lung cancer) and 6 non-cancer diseases (Alzheimer's disease, COPD, coronary heart disease, diabetes, heart failure and rheumatoid arthritis). The morbidity and mortality of these diseases are significantly high, seriously threatening people's lives (WHO, 2016). Especially, lung cancer, colon cancer, liver cancer and gastric cancer ranked the top four in the global cancer mortality rate in 2020 (Sung et al., 2021).
Since new biomedical knowledge is being presented every day, almost all constructed biomedical KGs are incomplete (Nickel et al., 2016). In addition to the methods mentioned above, new knowledge can also be reasoned by the existing knowledge. Knowledge Graph Embedding (KGE) has recently emerged as a paradigm for KG reasoning (Alshahrani et al., 2021;Wang et al., 2017). KGE maps entities and relations into a low-dimensional vector space, using simple mathematical calculations instead of explicitly defining the reasoning process, improving computational efficiency vastly. KGE model defines a scoring function f ðh; r; tÞ to measure the probability of the existence of a triplet (Bordes et al., 2013). To improve the reasoning effectiveness, some models aim to strengthen the expressive ability of the scoring function (Nguyen et al., 2018;Wang et al., 2014), and some multimodal models add additional information, such as categories (Xie et al., 2016) and descriptions (Nie and Sun, 2019).
This study proposes a complete SDKG construction and multimodal reasoning process. Firstly, we constructed the original SDKGs from biomedical literature. Secondly, we built the standard entity set from specialized biomedical databases. Then, we refined the original SDKGs by entity linking and relation linking to obtain SDKG-11. Finally, we reasoned on the SDKGs by multimodal KGE model. To verify the reliability of inferential knowledge, we manually proofread the predicted drug-gene, gene-disease and diseasedrug pairs. To demonstrate the universality of embedding results, we served them as pre-trained knowledge for biomolecular interaction classification.

Original triplet extraction
Based on the aliases of 11 selected diseases (Supplementary Appendix S1), we extract triplets from the titles, running titles, keywords, abstracts and conclusions of PubMed indexed literature published between 1980 and 2020. We only consider journals with an impact factor no <2.0 of the year 2020.
Triplet extraction has two main steps: Named Entity Recognition (NER) Wang et al. (2019b) and Relation Extraction (RE) (Sangrak and Kang, 2018). NER identifies biomedical entities from literature texts, and we use Att-BiLSTM-CRF (Luo et al., 2018) to accomplish that. RE extracts the relation among these entities identified by the NER, which is performed by a combination of BiLSTM (Kiperwasser and Goldberg, 2016) and ResNet (He et al., 2016).
We integrate all the extracted triplets of each specific disease as original SDKG, combine the five cancers as a Cancer5 KG, and build a Disease11 KG consisting of all the 11 diseases. In addition, we record the complete sentence of each original triplet's provenance for subsequent processing. Figure 1 is the flow chart for this section.

Standard entity set
Biomedical entity names are easily plagued by synonymy and polysemy phenomena, which will increase the unreliability and redundancy of KG. For the former, 'HCC' and 'liver cancer' may both mean the entity 'hepatocellular carcinoma'. This step builds a standard entity set containing all the synonymies so that the same entities can point to a unique node. For the latter, besides denoting a disease, 'HCC' also corresponds to two genes, a phenotype, and a small molecule (Fig. 2). It will be addressed by entity disambiguation in Section 2.3.1.
Category annotations are particular for each entity type. For example, the category annotation of a protein includes its status ('Experimental evidence at protein level', etc.) and its Gene Ontology (The Gene Ontology Consortium, 2019) annotations. Besides, entity type (gene, disease, etc.) is treated as a particular category. The detailed category annotations are provided in Supplementary Appendix S2.
Description annotations consist of multiple text contents concatenated in order of importance. For example, the description annotation of a disease is attached by summary, clinical features, molecular genetics, mapping and inheritance texts. As some descriptions are empty characters, we use the splice of all synonymies instead. The detailed description annotations are provided in Supplementary Appendix S3.

Entity linking
Original triplets extracted from literature should be linked to the standard entity set. Firstly, we perform entity normalization for original entities and synonymies in the standard entity set, including outlier screening, token stem processing and token reorder. Then, we link original entities to the standard entity set with the principle of string complete matching. Since a synonymy may appear in multiple standard entities, we build an end-to-end entity disambiguation model to select the most suited standard entity for 1-to-N mapping pairs.
Using the contexts of original entities (i.e. complete sentences) and the contexts of standard entities (i.e. description annotations) as inputs, the disambiguation model outputs the matching score through an encoder and a Fully Connected layer. We serve all 1-to-1 mapping pairs as the positive set and generate a negative set of equal numbers for them. The combination of these two sets is then randomly divided into a training set (90%) and a validation set (10%).
We apply a pre-training language model BERT (Devlin et al., 2019), and its biomedical version BioBERT (Lee et al., 2019), as encoders, respectively. They both perform fine-tuning by 10 epochs, and their accuracies are checked on the validation set. Figure 2 is the structure and an example of this step. More information about the pre-training language models and the complete training details are provided in Supplementary Appendices S4 and S5.

Relation linking
The main problems of biomedical relations are noises and synonymy phenomena, so we perform relation normalization for all original relations. Besides outlier screening and token stem processing, we also do part-of-speech tagging and keep only nouns, verbs and adverbs. Unlike entities, relations are not numerous, and lack a standard set to match. Therefore, we build a mapping dictionary manually for frequently occurring relations with reference to a built relation hierarchy structure (Zhao et al., 2019). For instance, the relations 'related' and 'correlated' are both represented by the relation 'associate'.

Knowledge graph embedding
We build the multimodal reasoning model by three parts: structure embedding (S), category embedding (C) and description embedding (D). For each SDKG, we randomly divide it into a training set (80%), a validation set (10%) and a test set (10%), ensuring that all entities and relations have appeared in the training set.
TransE believes that if a triplet exists, its vector representations should conform to: h þ r % t. Its structure embedding is defined as:  Note: The last two columns indicate the number of category annotations and non-empty character percentage of description annotations for each entity type.
S TransE h; r; t ð Þ¼ h þ r À t: Despite its simplicity and efficiency, TransE cannot handle non-1-to-1 relations. TransH introduces a hyperplane normal vector w r for each relation r so that each entity has a different vector representation facing foreign relations. While ConvKB incorporates the principle of TransE: h þ r À t, and the convolution operation makes the model more parameter-efficient. Their structure embeddings are defined as: S TransH ðh; r; tÞ ¼ ðh À w T r hw r Þ þ r À ðt À w T r tw r Þ; S ConvKB ðh; r; tÞ ¼ ReLUð½h; r; t Â XÞ; where matrix ½h; r; t is the concatenation of a triplet, Ã is the convolution operator, X is the concatenate of filters (initialized as a 1 Â 3 vector ½0:1; 0:1; À 0:1) and ReLU x ð Þ ¼ maxðx; 0Þ.

Category embedding
We first randomly initialize a category embedding matrix with the same dimension as structure embedding, which will be learned jointly with structure embeddings. Then we take the mean value of embedding vectors of entity e's all categories as its category embedding: where e c is the category set of e. The category embedding of a triplet is defined as:

Description embedding
We use BioBERT to convert description annotations into computable vectors. The description embedding of entity e is defined as: where e d is the description annotation of e, and W D is a weight matrix. We train the description embedding of all entities in advance by 10 fine-tuning epochs and fixed them as characteristic inputs. The description embedding of a triplet is defined as:

Multimodal learning
Cross-embedding (Tang et al., 2019;Xie et al., 2016) and hyperplane projection (HP) (Guan et al., 2019;Xiao et al., 2017) are two traditional multimodal learning methods. The cross-embedding scoring function for TransE is defined as: and similar formulas for TransH and ConvKB. In the HP method, category and description embeddings are regarded as two normal vectors (Fig. 3A). However, we believe that structure embedding should be the core part of multimodal learning, since it contains essential knowledge from literature. Hence, we propose reversehyperplane projection (reverse-HP), which regards the structure embedding as the hyperplane (Fig. 3B). On the one hand, we want to minimize the module of structure embedding vector, which is consistent with the original intention of structure embedding; on the other hand, we wish to maximize the projections of category and description embeddings on structure hyperplane, that fully extract the meanings of annotations. The results of this work are based on reverse-HP after comparison (Fig. 4). For TransE and TransH, the final scoring functions of reverse-HP are defined as: where S Ã ¼ S=jjSjj 2 2 , k C and k D are weight parameters. And the loss function is defined as: where c is a margin hyper-parameter, G þ represents the positive triplet set, and G À represents the artificially generated negative triplet set by Bernoulli trick (Wang et al., 2014). For ConvKB, the final scoring function is defined as: where W S is a weight matrix, vecðÁÞ transforms a matrix into a vector of equal elements. And the loss function is defined as:

Experimental setup
We use the Adam optimizer (Kingma and Ba, 2014) to minimize the loss function on the training set, finding the optimal hyper-parameters by grid search strategy on the validation set, evaluating the model on the test set. The search scopes of hyper-parameters are: embedding size k 2 f100; 200g, margin c 2 f0:2; 0:6; 1:0g, number of filters n f 2 f10; 20; 30g, learning rate l r 2 f5 Â 10 À4 ; 10 À3 ; 5 Â 10 À3 g, k 2 f0:0; 0:1; 0:2; 0:3; 0:4; 0:5g. In addition, we fix the batch size to be 1/40 of the training set size, and the max training epochs is 1000. All the embeddings are initialized by Glorot initialization (Glorot and Bengio, 2010) with the boundary of ðÀ ffiffiffiffiffiffiffiffi 6=k p ; ffiffiffiffiffiffiffiffi 6=k p Þ. The evaluation protocol of KGE models is the entity prediction task, namely, given an entity and a relation, predict another entity. Our evaluation metric is the mean rank (MR) of correct answers. Note that lower MR means better performance, and we only consider the 'filter' setting (Wang et al., 2014). We also evaluate on PharmKG (Zheng et al., 2021), a dedicated KG benchmark for biomedical data mining. Since there are no entity category and description annotations for PharmKG, we annotate the overlap part with the standard entity set and override the rest with null values.
In experiments, we consider the following four configurations: 1. S stands for using structure embedding only (k C ¼ 0; k D ¼ 0). 2. S þ C stands for using structure and category embeddings (k C 6 ¼ 0; k D ¼ 0). 3. S þ D stands for using structure and description embeddings (k C ¼ 0; k D 6 ¼ 0). 4. S þ C þ D stands for using all the three embeddings simultaneously (k C 6 ¼ 0; k D 6 ¼ 0).

Statistical superiority test
To test the superiority of the selected model and configuration, we perform a multivariate Analysis of Variance (ANOVA) on the score rank. Due to the non-normal rank distribution, we employ the robust ANOVA based on median and median-of-means estimations using R package WSR2 (Mair and Wilcox, 2020). Further, onetailed Wilcoxon tests are used as the post hoc tests from another perspective.

Reliability of inferred knowledge
We perform the KG completion task on drug-gene, gene-disease and disease-drug pairs. For all the possible pairs, we calculate their scores (all relations are substituted into and retain the highest score) by the trained scoring function on each SDKG. We assume the topscored inferred items (not in the training set) with a scale of 10% of the training set size as reliable new inferred knowledge. Then, they are combined with existent knowledge to construct comprehensive networks, which are illustrated by Cytoscape (Su et al., 2014). We further compare the disease-gene prediction result with advanced network-based methods, LINE (Tang et al., 2015), Node2vec (Grover and Leskovec, 2016) and HerGePred (Yang et al., 2019). Association Precision (AP) is served as an evaluation metric: where D is the test disease set, T d ð Þ represents the test gene set of disease d and P d ð Þ is the top T d ð Þ predicted gene set. In terms of clinical significance, we are especially interested in new inferred disease-drug pairs with potential clinical applications. We perform co-clustering by CoClust (Role et al., 2019) for comprehensive disease-drug pairs, which is a Python package based on Kmeans clustering for one-zero variables. We set the number of clusters of each bilateral clustering as two because diseases can be divided into cancerous and non-cancerous, and the same for drugs.

Universality of embedding models
We use embedding results as initialization parameters for the biomolecular interaction classification task, whose dataset is extracted from Pathway Commons v12 (Rodchenko et al., 2019). We aim to predict the interaction in an entity pair by two steps: Step 1 to judge whether an entity pair interacts (manually generate an equal number of non-interacting entity pairs), then Step 2 to predict which kind of interaction between the interacting entities has. Finally, the overall prediction accuracy is calculated by: Acc Interacting ð Þ Â Acc Step 2 ð Þþ Acc Non À interacting ð Þ ½ =2. The initial embeddings of all entities will depend on the following five configurations: NONE stands for Glorot random initialization, while P, P þ C, P þ D and P þ CþD stand for initialized by pre-trained embedding results of S, S þ C, S þ D and S þ CþD configurations, respectively. Dataset description, model structure and training details are provided in Supplementary Appendix S6.

Specific disease KGs
In the entity disambiguation step, BioBERT and BERT both converge within 10 fine-tuning epochs, showing the incredible power of the pre-training language model. And they achieve 91.3% and 90.6% accuracy on the validation set, respectively. Therefore, we use the results of entity disambiguation by BioBERT in the following analyses (Table 2). In the relation linking step, all the constructed SDKGs have 67 relations, mapped by relation hierarchy structure.

Entity prediction
From MR comparisons (Table 3) and statistical analyses of robust ANOVA and post hoc Wilcoxon tests (Supplementary Table S1), we can observe that: 1. ConvKB achieves the best performance in most conditions with statistical significance. 2. Among the 14 KGs and 3 structure embedding algorithms, 35 S þ C þ D configurations achieve the best performance (0 for S and S þ C, 8 for S þ D). However, the superiority of S þ C þD over S þ D lacks statistical significance. 3. PharmKG has a comparably slight lift, since it is only partially annotated. Figure 4 shows how performance (MR of the Disease11) changes by varying k C and k D for HP and reverse-HP, respectively, one fix at 0 when another change from 0 to 0.6 (we assume that k C and k D are independent, because they are the weights of two parts). We can see that HP outperforms reverse-HP initially, but reverse-HP is consistently better as k increases. In comparing with cross-embedding, S þ C þ D configuration of both HP and reverse-HP perform better with the appropriate k C Ã and k D Ã .

New knowledge inference
According to the criteria that 10% scale of the corresponding training set size, we finally obtain reliable new inferred knowledge of drug-gene, gene-disease and disease-drug pairs by ConvKB (S þ C þ D) (Supplementary Tables S2-S4). These results may be new discoveries that are the potential research directions. Table 4 lists the top 10 new inferred pairs of the Disease11. Respectively, 8, 9 and 9 evidences from literature can be found for 10 pairs of these  Note: The best configuration under each structure embedding algorithm is noted in bold. The best configuration for each SDKG is noted in underlining. Note: The Rank column also ranks the pairs in the training set. Equivalent means that drug and gene refer to the same biological concept, so we treat it as a correct prediction. three types, which fully demonstrate the reliability of our model in discovering new knowledge. And the pairs for which no evidence has been found may be some findings that beyond the scope of current knowledge.
For the disease-gene prediction task (Table 5), all the diseasegene networks constructed by ConvKB and other three advanced network-based methods, SDKGs have relatively small AP due to domain-induced incompleteness. Nevertheless, ConvKB outperforms network-based methods, thanks to the relation embedding and multimodal annotations to compensate for network incompleteness. Figure 5 shows the networks containing all the entities that are linked by inferred edges. The node in the center of each network is the disease itself, which is mostly connected by existent knowledge (black edges) as expected. Our model then reasons out new potential

Application of the inferred knowledge
From the drug-disease part in the Disease11 (Fig. 6A), we can observe that: in most cases, anticancer drugs are used to treat cancer and vice versa. Although non-anticancer drugs are also used extensively in cancer treatment, the reverse is not true. Most of the new inferred disease-drug pairs are in the field of anticancer drugs to treat cancer. This means expanding some anticancer drugs to more types of cancer will be the mainstream direction of drug repurposing. There are plenty of potential clinical applications of repurposing non-anticancer drugs in their original field to treat non-cancerous diseases. However, repurposing of non-anticancer drugs in carcinoma can be rarely inferred from existing knowledge. They should be mainly dependent on some subversive discoveries beyond current knowledge. Further, we focus on fdrug, gene, diseaseg closed-triplets for the mutual corroboration of both reliability and systematicness. The desired closed-triplet consists of each node type and contains at least one new inferred edge (Supplementary Table S5). On the one hand, we can get more evidence support from the other two edges of the triplet; on the other hand, the triplet itself is a logically closed loop that can naturally form a proposition that 'Gene associates Disease, and Drug effects on Disease by influencing Gene (products)'. Take an example from the Cancer5, we discover a cluster of vitamin D3 centric closed-triplets (Fig. 6B). Vitamin D3 has been predicted to prevent or alleviate breast cancer, and the effects may work through genes, such as IL10. The effects of vitamin D3 on gene IL10 (Boontanrart et al., 2016) and IL10 plays an essential role in breast cancer (Moghimi et al., 2018) have both been studied. While some studies partly supported our predictions that vitamin D3 may prevent or alleviate breast cancer (Wu et al., 2017) but without direct evidence. Table 6 shows the biomolecular interaction prediction result by ConvKB of the Disease11, from which we can observe that: Fig. 6. (A) Co-clustering result for comprehensive disease-drug pairs on the KG combining all the 11 diseases. These pairs are automatically grouped into four distinct clusters (I: anticancer drugs $ non-cancer diseases; II: non-anticancer drugs $ non-cancer diseases; III: anticancer drugs $ cancers; and IV: non-anticancer drugs $ cancers). Red and gray nodes denote inferred and existent knowledge, respectively. (B) A cluster of 15 fdrug, gene, diseaseg closed-triplets with vitamin D3 and breast cancer nodes have emerged in the Cancer5 after reasoning. The inferred and existent edges are in green and black, respectively 1. Although NONE and P configurations have the same structure, initialized with structure embedding achieves better prediction accuracy than random initialization. It strongly supports the universality of pre-trained embedding. 2. P þ C þ D configuration has the highest prediction accuracy in all steps. It indicates that multimodal learning can further improve the performance of biomolecular interaction classification. 3. In Step 2, some interactions (controls-phosphorylation-of and reacts-with) have a rather low prediction accuracy, mainly due to their rather small sample sizes.

Discussion
The SDKG-11 is constructed based on biomedical literature, and the build process produces large-scale original triplets almost without human intervention. As a highly condensed knowledge carrier, biomedical literature contains virtually all the knowledge that has been discovered and is being studied. Thus, it should be an ideal source of triplet extraction. As for another primary source of biomedical knowledge, EMR has low overall data quality, with a lot of unforeseen noise, and varies widely across regions. However, one of the most significant advantages of EMR-based triplet extraction is that it is more clinical and real-world (Li et al., 2020). Therefore, we would like to combine literature with EMR to build comprehensive SDKGs containing real-world data next. At the model level, we evaluate TransE, TransH and ConvKB as structure embedding parts, and the experimental result shows that ConvKB is the most efficient one. The main reason is that ConvKB not only considers the transitional characteristics of TransE, but also takes advantage of the effectiveness of the convolutional neural network. If we intend to consider more advanced structure embeddings, graph neural network-based KGE model would be a solid choice due to its natural fit structure with the KG. The combination of KG and graph convolutional networks (Schlichtkrull et al., 2018), as well as KG and graph attention networks (Che et al., 2021;Nathani et al., 2019), both have been studied. We consider that serving multimodal embedding as multiple features of the graph's nodes would be a promising attempt.
As for multimodal learning, we apply reverse-HP rather than crossembedding nor HP. Cross-embedding combines structure embedding with other modal embeddings, but cannot adjust the weight of each modal embedding. Previous HP models only considered description hyperplane, and their description embeddings were generated by topic model (Xiao et al., 2017) and skip-gram model (Guan et al., 2019). In this work, the weight parameters represent the contribution degree of each annotation with certain interpretability. Both category and description annotations can promote the inference effect (Fig. 4), because they make up the incompleteness of SDKGs to some extent, and provide new knowledge sources for structure embedding. We also observe that description annotation is better than category annotation, since the former has much more information. Moreover, when there is already description annotation (S þ D), adding category annotation (S þ CþD) will not improve much. Suppose, we intend to consider more modal annotations, image annotation (e.g. structure schematics of proteins and drugs) is the most likely annotation to be added (Xie et al., 2017). In addition, the annotation of relation (Tang et al., 2019) and multi-omics (Zheng et al., 2021) are also potentially promising directions.
From the perspective of knowledge relevance, triplets from the literature of each specific disease are more focused. However, considering the completeness of information from the perspective of 'big data', we highly recommend combining all available information into one corpus, since some inferred knowledge from one SDKG may already exist in another SDKG. Next, we will try to extract knowledge from more general themes, such as 'cancer' or 'disease' to construct the initial KG.

Conclusion
In this work, we have proposed a complete specific disease KG construction and multimodal reasoning process. We have constructed SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a combined Cancer5 and a combined Diseases11. We have evaluated our multimodal KGE model by entity prediction task and verified in some instances. We have then demonstrated the influential role of the learned embedding in the downstream biomedical task. All of the above suggests that the new knowledge, we reasoned is reliable, and the embeddings, we learned are universal. They can be helpful for research and clinical staffs in the field of some specific diseases.